Skip to main content

Feature Engineering

After cleaning and preprocessing the data, we created the relevant and important features for forecasting.

Lags

The previous month data is the lag for the current month. For example, if you want to forecast for 2023 June, sales data of 2023 May will be the first lag, 2023 April will be the second lag and so on. It is expected that the sales of the next month will depend on the sales pattern of the last few months in some manner.

final = pd.DataFrame()
for brand in df["brand"].unique():
temp = df[df["brand"] == brand]
for i in range(1, n_lags + 1):
temp[f'lag_{i}'] = temp["quantity_delivered"].shift(i)
final = pd.concat([final, temp])
final.fillna(0, inplace=True)
return final

These lags were autoregressed. This means, forecast of 2023 December will have sales data of 2023 November as the first lag, but 2023 November itself has to be forecasted - we don't have actual data for it.

So, the forecast for 2023 November will be taken as the first lag for 2023 December. Similarly, the forecast of 2023 December will be taken as first lag of 2024 January and forecast of 2023 November will be taken as second lag of 2024 January. And so on.

a = [f"lag_{i}" for i in range(1, lags + 1)]
b = ["forecast"] + a[:-1]
df.loc[(df.index == time), a] = df.loc[(df.index == time - relativedelta(months=1)), b].values

Seasonal features

There are two kinds of seasonal features - trend and seasonality.

Trend is the overall movement of the data over a period of time. If a time series keeps fluctuating over a period of time (say 4 months) but has an overall positive movement, it will have a positive trend. In other words, the slope of the linear regression line fit over the time series for that period will be its trend for the period.

Seasonality is the periodic fluctuation of the data. If a time series regularly achieves peaks in summer and bottoms out in winter, it has an annual seasonality calculated by some statistical methods.

There are two ways to get seasonal features - one is deterministic and the other is not.

df_temp = df[df.index < "2023-01-01"]
decomposed = seasonal_decompose(df_temp["quantity_delivered"], period=12, extrapolate_trend=1)
df['trend_1'], df['seasonal'] = decomposed.trend, decomposed.seasonal
features = DeterministicProcess(df.index, order=order, fourier=fourier).in_sample()
df = pd.concat([df, features], axis=1)

These trend and seasonality have to be recalculated every time a new forecast is made:

decomposed = seasonal_decompose(df_new.forecast, period=12, extrapolate_trend=1)
trend_1, seasonal = decomposed.trend, decomposed.seasonal
df_1 = pd.DataFrame({'trend_1': trend_1, 'seasonal': seasonal})
df_new = pd.concat([df_new, df_1], axis=1)
df_new.set_index("date", inplace=True)
df.loc[(df.index == time), ["trend_1", "seasonal"]] = df_new.loc[(df_new.index == time), ["trend_1", "seasonal"]].values

Time features

The year, month and quarter of the time period can also be important features for monthly forecasting. For daily forecasting, there can also be week, day, etc.

df.reset_index(inplace=True)
df['year'] = df["date"].dt.year
df['month'] = df["date"].dt.month
df['quarter'] = df["date"].dt.quarter
df.set_index("date", inplace=True)

Cyclic features

1st day of a month is actually close to 30th day of a month, but the numbers say otherwise. Similarly, January is close to December, but the ordinal encoding says otherwise.

Due to this, it makes sense to have cyclic time features given by sine and cosine waves instead of linear time features.

df.reset_index(inplace=True)
df['month_sin'] = np.sin(2 * np.pi * data["date"].dt.month / 12)
df['month_cos'] = np.cos(2 * np.pi * data["date"].dt.month / 12)
df.set_index("date", inplace=True)

There used to be another feature - season - which denoted which agriculture season it was (Rabi or Kharif) since this was sales forecasting for agriculture products, but it was applicable to only a few specific countries (India) and thus abandoned.

Every country and brand gives optimal forecasts with different sets of features. For some, seasonal features might be unnecessary, while for others, time features might be detrimental. Thus, every country and brand is treated on a case-by-case basis.